Optimal kernel choice for large-scale two-sample tests

نویسندگان

Arthur Gretton

Bharath K. Sriperumbudur

Dino Sejdinovic

Heiko Strathmann

Sivaraman Balakrishnan

Massimiliano Pontil

Kenji Fukumizu

چکیده

Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis that p = q, based on the value of a test statistic measuring the distance between the samples. One choice of test statistic is the maximum mean discrepancy (MMD), which is a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel used in obtaining these embeddings is critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability. A means of parameter selection for the two-sample test based on the MMD is proposed. For a given test level (an upper bound on the probability of making a Type I error), the kernel is chosen so as to maximize the test power, and minimize the probability of making a Type II error. The test statistic, test threshold, and optimization over the kernel parameters are obtained with cost linear in the sample size. These properties make the kernel selection and test procedures suited to data streams, where the observations cannot all be stored in memory. In experiments, the new kernel selection approach yields a more powerful test than earlier kernel selection heuristics.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

B-tests: Low Variance Kernel Two-Sample Tests

A family of maximum mean discrepancy (MMD) kernel two-sample tests is introduced. Members of the test family are called Block-tests or B-tests, since the test statistic is an average over MMDs computed on subsets of the samples. The choice of block size allows control over the tradeoff between test power and computation time. In this respect, the B-test family combines favorable properties of p...

متن کامل

Adaptivity and Computation-Statistics Tradeoffs for Kernel and Distance based High Dimensional Two Sample Testing

Nonparametric two sample testing is a decision theoretic problem that involves identifying differences between two random variables without making parametric assumptions about their underlying distributions. We refer to the most common settings as mean difference alternatives (MDA), for testing differences only in first moments, and general difference alternatives (GDA), which is about testing ...

متن کامل

Investigating the Impact of Response Format on the Performance of Grammar Tests: Selected and Constructed

When constructing a test, an initial decision is choosing an appropriate item response format which can be classified as selected or constructed. In large-scale tests where time and finance are of concern, the use of response chosen known as multiple-choice items is quite widespread. This study aimed at investigating the impact of response format on the performance of structure tests. Concurren...

متن کامل

Determination of optimal bandwidth in upscaling process of reservoir data using kernel function bandwidth

Upscaling based on the bandwidth of the kernel function is a flexible approach to upscale the data because the cells will be coarse-based on variability. The intensity of the coarsening of cells in this method can be controlled with bandwidth. In a smooth variability region, a large number of cells will be merged, and vice versa, they will remain fine with severe variability. Bandwidth variatio...

متن کامل

Exponentially Consistent Kernel Two-Sample Tests

Given two sets of independent samples from unknown distributions P and Q, a two-sample test decides whether to reject the null hypothesis that P = Q. Recent attention has focused on kernel two-sample tests as the test statistics are easy to compute, converge fast, and have low bias with their finite sample estimates. However, there still lacks an exact characterization on the asymptotic perform...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2012

Optimal kernel choice for large-scale two-sample tests

نویسندگان

چکیده

منابع مشابه

B-tests: Low Variance Kernel Two-Sample Tests

Adaptivity and Computation-Statistics Tradeoffs for Kernel and Distance based High Dimensional Two Sample Testing

Investigating the Impact of Response Format on the Performance of Grammar Tests: Selected and Constructed

Determination of optimal bandwidth in upscaling process of reservoir data using kernel function bandwidth

Exponentially Consistent Kernel Two-Sample Tests

عنوان ژورنال:

اشتراک گذاری